home *** CD-ROM | disk | FTP | other *** search
- <html>
- <head> <title> Top Tips on programming in ARM Assembler </title> </head>
- <body>
-
- This page demonstrates some of the optimisations which are possible when
- programming in ARM assembler. It's based on my experience of optimising
- other people's code, so it's real examples of tricks that people overlook.
- This page assumes a familiarity with the ARM instruction set, and
- programming in assembler in general.
-
- <h1> Use of conditionals </h1>
-
- Conditional execution of instructions is one of the best things about the
- ARM instruction set, and yet people overlook it so often. For example,
- returning with the V bit set from a function is often used to indicate
- an error occurred:
-
- <pre>
- STMFD r13!, {r14}
- BL func1
- LDMVSFD r13!, {pc}
- ADD r0, r1, r2
- BL func2
- LDMVSFD r13!, {pc}
- SUB r1, r2, r0
- BL func3
- LDMVSFD r13!, {pc}
- LDMFD r13!, {pc}^
- </pre>
- can be rewritten as:
- <pre>
- STMFD r13!, {r14}
- BL func1
- ADDVC r0, r1, r2
- BLVC func2
- SUBVC r1, r2, r0
- BLVC func3
- LDMVCFS r13!, {pc}^
- LDMFD r13!, {pc}
- </pre>
-
- A saving of 8 bytes in space, and 3 instructions in execution time for the
- case when no errors occur - which is hopefully the more common! I also
- find it more readable and it's easier to trace the flow of execution.
- Naturally, if you have to do comparisons then this is not possible and
- you'll have to either exit or branch over the conditional segment of code.
- It can be more efficient to do this if you have a long procedure exit
- sequence, but normally exiting is merely a matter of unstacking the
- registers and writing the appropriate value to the program counter which
- is one instruction.
-
- <h1> Use of the correct comparison </h1>
-
- The different types of comparison available often confuse the novice
- programmer. Use of signed comparisons instead of unsigned is the most
- common error which is made, which can be merely inefficient (and
- <a href="#final">here's an example</a>), but often leads to bugs when
- memory ranges are suddenly greater than 2GB.
-
- <pre>
- Unsigned Signed
- HS (CS) GE
- HI GT
- LS LE
- LO (CC) LT
- </pre>
-
- PL and MI directly test bit 31 of the result (ie the N flag is a copy of
- bit 31). They are subtly different to GE and LT and I'd like someone to
- give me a case where they should be used instead of GE or LT.
-
- <h1> Don't misuse the stack </h1>
-
- Obviously, don't stack registers that aren't required to be preserved -
- which these are depends on whether you are writing APCS conforming code,
- or whether you have your own Procedure Call Standard, or whether you work
- on an ad hoc basis. But it's not necessary to stack r14 if the function
- is a leaf function (calls no other functions). However, if you do find
- yourself in need of more registers than you have available, then r14
- should be the first register you stack on the grounds of efficiency.
- Consider the two functions:
-
- <pre>
- STMFD r13!, {r14}
- MUL r0, r1, r2
- LDMFD r13!, {pc}
-
- STMFD r13!, {r1, r2}
- LDR r1, [r12, #8]
- LDR r2, [r12, #12]
- MUL r0, r1, r2
- LDMFD r13!, {r1, r2}
- MOVS pc, r14
- </pre>
-
- Better would be:
-
- <pre>
- MUL r0, r1, r2
- MOVS pc, r14
-
- STMFD r13!, {r14}
- LDR r14, [r12, #8]
- LDR r0, [r12, #12]
- MUL r0, r14, r0
- LDMFD r13!, {pc}
- </pre>
-
- In the second function, restoring registers is combined with returning
- from the procedure call. Since r14 must be corrupted by the function
- call, the caller is not expecting its value to be any particular value at
- the end of the function. This has been exploited in at least one program
- which I know of to return function values in r14, but you really need
- to be aware of exactly what you're doing with that technique, and I've
- never dared try it myself.
- <p>
- A further optimisation which can be used is to use a single store rather
- than a multiple store if you're only storing r14. The above code then
- becomes:
-
- <pre>
- STR r14, [r13, #-4]!
- LDR r14, [r12, #8]
- LDR r0, [r12, #12]
- MUL r0, r14, r0
- LDR pc, [r13], #4
- </pre>
-
- Beware that you can't tell LDR to restore the program counter flags.
- This is not an issue if you are using 32-bit APCS-3, but most routines
- on 26-bit ARMs preserve the flags over a function call.
-
- <p>
- It can sometimes be a win to only stack registers conditionally. Here's
- an example:
-
- <pre>
- LDR r2, [r12, #0]
- TEQ r0, r2
- MOVEQ pc, r14
- STMFD r13!, {r14}
- BL func
- SUB r0, r0, #1
- LDMFD r13!, {pc}
- </pre>
-
- This can be more efficient if r0 is often equal to r2; for example if this
- were an error check.
-
- <p>
- Another way in which the stack can be abused is to store data _below_ the
- stack pointer. This will appear to work but you will get random crashes.
- This is because interrupt code is entitled to use space on the stack
- temporarily; if you have left the stack pointer in the wrong place, you
- will have problems.
-
- <h1> References to data within a big program </h1>
-
- When your program gets to more than 4k, you start to run the risk of
- not being able to address data. This is because LDR has a maximum
- range of +/-4095 bytes. ADR (which is a pseudoinstruction for
- ADD rN, pc, x, which takes into account the fact that
- the program counter is 2 instructions ahead of the currently executing
- instruction) uses the normal 8 bits rotated by a multiple of 2 scheme,
- so this can get you further away but at the cost of not being able to
- directly access every address. Many people have FNadrl macros and at
- least two assemblers that I know of have ADRL instructions. They are
- implemented as something like:
-
- <pre>
- ADR rN, (offset AND &FF)
- ADD rN, ((offset - P%)AND &FF00)
- </pre>
-
- with slight modifications to cope with a negative offset, and the cunning
- ones actually try various possibilities such as
-
- <pre>
- ADR rN, (offset AND &3FC)
- ADD rN, ((offset - P%) AND &3FC00)
- </pre>
-
- to gain them more distance.
-
- <p>
- Nevertheless, these pseudo-instructions take 2 instructions, and ADRX
- (for constants which can't be reached with ADRL) takes 3. It is better
- to avoid using these if possible. Try moving data around so that it's
- nearer the functions which reference it. If that's not possible, try
- moving functions around.
-
- <p>
- If you have data which are used throughout the program, consider
- allocating a register to point to their address throughout the execution
- of the program. In RISC OS modules, this is assisted by Acorn by pointing
- r12 at a block of private workspace.
-
- <p>
- Under no circumstances consider doing the following:
-
- <pre>
- BL getvar
- ...
- .getvar
- LDR r0, var
- MOVS pc, r14
- .var
- EQUD 0
- </pre>
-
- since this involves two pipeline flushes and wrecks most caching
- strategies so it is guaranteed to be slower. It also goes against the
- principle of keeping data and code separate which is more important on
- StrongARM processors with their split instruction/data caches.
-
- <h1> Working with large constants </h1>
-
- By large constants, I don't mean constants with a large magnitude like 2^31,
- I mean constants which don't fit in the ARM's scheme for immediate constants;
- ie 8 bits rotated by a multiple of 2. This example is from IscaFS:
-
- <pre>
- LDR r0, [r6, #24]
- BIC r0, r0, #&ff000000
- BIC r0, r0, #&00ff0000
- MOV r1, r1, #&53
- ORR r1, r1, #&ef00
- TEQ r0, r1
- </pre>
-
- I replaced this with:
-
- <pre>
- LDR r0, [r6, #24]
- MOV r0, r0, LSL #16
- EOR r0, r0, #&ef000000
- TEQ r0, #&00530000
- </pre>
-
- Shifting r0 left by 16 automatically zeroes the leftmost 16 bits
- and discards the previous top 16 bits, which is the desired effect.
- The crucial step is noticing that TEQ is the same as EORS, without a
- destination register. So if the top byte of r0 is not &ef then the
- second test can never be true. This trick saves 2 instructions and
- one register.
-
- <p> A similar technique can apply to other situations, for example
- checking that a 16 bit value is less than a given value can be done
- by
-
- <pre>
- CMP r0, #&xy00
- CMPEQ r0, #&00za
- </pre>
-
- is the same as
-
- <pre>
- MOV r1, #&xy00
- ORR r1, #&00za
- CMP r0, r1
- </pre>
-
- but takes one instruction fewer and uses one register fewer.
- <p>
- If it's utterly unavoidable, you may need to put a large constant into a
- register. It is thought that there are no constants which cannot be
- produced in 3 instructions, though no-one has an algorithm for producing
- arbitrary constants (to the best of my knowledge). On a StrongARM, it
- is normally quicker to load an instruction instead of using 3 to synthesise
- it. On an ARM6, it is quicker to use 3 instructions. Take your pick.
-
- <h1> Strength reduction </h1>
-
- This term covers a number of optimisations. One is to reduce the cost of
- per-iteration calculations.
-
- <pre>
- MOV r0, #200
- MOV r4, #8
- .loop
- MUL r1, r0, r4
- ...
- SUBS r0, r0, #1
- BNE loop
- </pre>
-
- becomes
-
- <pre>
- MOV r1, #1600
- .loop
- ...
- SUBS r1, r1, #8
- BNE loop
- </pre>
-
- This saves a MUL instruction in the loop, which is a slow operation on many
- variants of the ARM. In this case, it also saves 2 registers, though in
- practice this is not often achieved.
-
-
- <h1> Count down, not up </h1>
-
- The above example also illustrates that it's better to count down in a loop
- than up. Here's a worse example:
-
- <pre>
- MOV r1, #0
- .loop
- ...
- ADD r1, r1, #8
- CMP r1, #1600
- BNE loop
- </pre>
-
- The SUBS used above combines the test-for-end with the loop-count, saving
- an instruction.
-
-
- <h1> Unrolling loops </h1>
-
- Since the instructions used per-loop are not doing useful work, it is often
- better to unroll the loop a little. This can add extra overhead in places,
- so you should be cautious. It may also pay to unroll the loop entirely, as
- the C compiler may sometimes do with memset. To reuse the example above:
-
- <pre>
- MOV r1, #1600
- .loop
- ...
- SUB r1, r1, #8
- ...
- SUBS r1, r1, #8
- BNE loop
- </pre>
-
- In this example, since I haven't specified what is in `...', it's not
- possible to combine the first SUB in with it which might be possible in
- real code. This optimistaion saves half an instruction (and a cache line flush) per loop
- - 800 in total. Against that, it takes more cache lines, and more RAM in
- general. The only way to find if this is a win or not is to benchmark the
- code in question.
-
-
- <h1> Dividing </h1>
-
- There are already several good algorithms out there and I
- don't think I have anything to contribute myself. Please see <a
- href="http://www.comlab.ox.ac.uk/oucl/users/robin.watts/Docs/Arc/Programmin/div-fast.html">
- This page</a> and <a
- href="http://www.comlab.ox.ac.uk/oucl/users/robin.watts/Docs/Arc/Programmin/div-short.html">
- this one</a> for good examples.
-
-
- <h1> Some further examples </h1>
-
- That's about all the coding tricks I can remember for the moment. If you
- want to see a real example of code which I've improved, try looking at
- IscaFS which was originally written by Phil Norman.
-
- <p>
-
- I'll leave you with some more fictional examples:
-
- <a name="final"></a>
- <pre>
- STMFD r13!, {r14}
- CMP r0, #5
- BGT not
- CMP r0, #0
- BLT not
- LDR r0, [r12, #8]
- B over
- .not
- MOV r0, #0
- SUB r0, r0, #1
- .over
- LDMFD r13!, {pc}
- </pre>
-
- Can be optimised to:
-
- <pre>
- CMP r0, #5
- LDRLS r0, [r12, #8]
- MVNHI r0, #0
- MOV pc, r14
- </pre>
-
- This example demonstrates several points:
- <ol>
- <li> Use of unsigned instead of signed comparisons
- <li> Use of conditionals to avoid branches
- <li> Not storing the link register if avoidable
- <li> Remembering one of the more `esoteric' instructions
- </ol>
-
- Thanks for reading, I appreciate feedback and if you have any tricks you'd
- like to share with the ARM programming community at large then please send
- them to me.
- <p>
- You probably want to look at these pages for further tips:
- <a href="http://www.comlab.ox.ac.uk/oucl/users/robin.watts/Docs/">Robin
- Watts' ARM programming page</a>
- <p>
- I'd like to thank Phil Norman and Peter Burwood for their help.
-
- <hr>
- <i><a href="mailto:willy@bofh.ai">Matthew Wilcox</a></i>
- </body>
- </html>
-